Code
import pandas as pd
import numpy as np
kakamana
January 23, 2023
We will cover a variety of aspects of feature engineering in this section, including how to use the features already present in a dataset to create new, more useful, features, while also showing you how to encode, aggregate, and extract information from both numerical and textual features within a dataset.
This Feature Engineering is part of Datacamp course: Preprocessing for Machine Learning in Python
This is my learning experience of data science through DataCamp
Take an exploratory look at the volunteer
dataset, using the variable of that name. Which of the following columns would you want to perform a feature engineering task on?
opportunity_id | content_id | vol_requests | event_time | title | hits | summary | is_priority | category_id | category_desc | ... | end_date_date | status | Latitude | Longitude | Community Board | Community Council | Census Tract | BIN | BBL | NTA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4996 | 37004 | 50 | 0 | Volunteers Needed For Rise Up & Stay Put! Home... | 737 | Building on successful events last summer and ... | NaN | NaN | NaN | ... | July 30 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 5008 | 37036 | 2 | 0 | Web designer | 22 | Build a website for an Afghan business | NaN | 1.0 | Strengthening Communities | ... | February 01 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 5016 | 37143 | 20 | 0 | Urban Adventures - Ice Skating at Lasker Rink | 62 | Please join us and the students from Mott Hall... | NaN | 1.0 | Strengthening Communities | ... | January 29 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 5022 | 37237 | 500 | 0 | Fight global hunger and support women farmers ... | 14 | The Oxfam Action Corps is a group of dedicated... | NaN | 1.0 | Strengthening Communities | ... | March 31 2012 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5055 | 37425 | 15 | 0 | Stop 'N' Swap | 31 | Stop 'N' Swap reduces NYC's waste by finding n... | NaN | 4.0 | Environment | ... | February 05 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 35 columns
Take a look at the hiking
dataset. There are several columns here that need encoding, one of which is the Accessible
column, which needs to be encoded in order to be modeled. Accessible is a binary feature, so it has two values - either Y
or N
- so it needs to be encoded into 1s and 0s. Use scikit-learn’s LabelEncoder
method to do that transformation.
Prop_ID | Name | Location | Park_Name | Length | Difficulty | Other_Details | Accessible | Limited_Access | lat | lon | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | B057 | Salt Marsh Nature Trail | Enter behind the Salt Marsh Nature Center, loc... | Marine Park | 0.8 miles | None | <p>The first half of this mile-long trail foll... | Y | N | NaN | NaN |
1 | B073 | Lullwater | Enter Park at Lincoln Road and Ocean Avenue en... | Prospect Park | 1.0 mile | Easy | Explore the Lullwater to see how nature thrive... | N | N | NaN | NaN |
2 | B073 | Midwood | Enter Park at Lincoln Road and Ocean Avenue en... | Prospect Park | 0.75 miles | Easy | Step back in time with a walk through Brooklyn... | N | N | NaN | NaN |
3 | B073 | Peninsula | Enter Park at Lincoln Road and Ocean Avenue en... | Prospect Park | 0.5 miles | Easy | Discover how the Peninsula has changed over th... | N | N | NaN | NaN |
4 | B073 | Waterfall | Enter Park at Lincoln Road and Ocean Avenue en... | Prospect Park | 0.5 miles | Easy | Trace the source of the Lake on the Waterfall ... | N | N | NaN | NaN |
Accessible | Accessible_enc | |
---|---|---|
0 | Y | 1 |
1 | N | 0 |
2 | N | 0 |
3 | N | 0 |
4 | N | 0 |
One of the columns in the volunteer
dataset, category_desc
, gives category descriptions for the volunteer opportunities listed. Because it is a categorical variable with more than two categories, we need to use one-hot encoding to transform this column numerically. Use Pandas’ get_dummies()
function to do so.
Education Emergency Preparedness Environment Health \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 1 0
Helping Neighbors in Need Strengthening Communities
0 0 0
1 0 1
2 0 1
3 0 1
4 0 0
A good use case for taking an aggregate statistic to create a new feature is to take the mean of columns. Here, you have a DataFrame of running times named running_times_5k
. For each name
in the dataset, take the mean of their 5 run times.
name | run1 | run2 | run3 | run4 | run5 | |
---|---|---|---|---|---|---|
0 | Sue | 20.1 | 18.5 | 19.6 | 20.3 | 18.3 |
1 | Mark | 16.5 | 17.1 | 16.9 | 17.6 | 17.3 |
2 | Sean | 23.5 | 25.1 | 25.2 | 24.6 | 23.9 |
3 | Erin | 21.7 | 21.1 | 20.9 | 22.1 | 22.2 |
4 | Jenny | 25.8 | 27.1 | 26.1 | 26.7 | 26.9 |
5 | Russell | 30.9 | 29.6 | 31.4 | 30.4 | 29.9 |
name run1 run2 run3 run4 run5 mean
0 Sue 20.1 18.5 19.6 20.3 18.3 19.36
1 Mark 16.5 17.1 16.9 17.6 17.3 17.08
2 Sean 23.5 25.1 25.2 24.6 23.9 24.46
3 Erin 21.7 21.1 20.9 22.1 22.2 21.60
4 Jenny 25.8 27.1 26.1 26.7 26.9 26.52
5 Russell 30.9 29.6 31.4 30.4 29.9 30.44
There are several columns in the volunteer
dataset comprised of datetimes. Let’s take a look at the start_date_date
column and extract just the month to use as a feature for modeling.
# First, convert string column to date column
volunteer['start_date_converted'] = pd.to_datetime(volunteer['start_date_date'])
# Extract just the month from the converted column
volunteer['start_date_month'] = volunteer['start_date_converted'].apply(lambda row: row.month)
# Take a look at the converted and new month columns
volunteer[['start_date_converted', 'start_date_month']].head()
start_date_converted | start_date_month | |
---|---|---|
0 | 2011-07-30 | 7 |
1 | 2011-02-01 | 2 |
2 | 2011-01-29 | 1 |
3 | 2011-02-14 | 2 |
4 | 2011-02-05 | 2 |
The Length
column in the hiking
dataset is a column of strings, but contained in the column is the mileage for the hike. We’re going to extract this mileage using regular expressions, and then use a lambda in Pandas to apply the extraction to the DataFrame.
import re
# Write a pattern to extract numbers and decimals
def return_mileage(length):
pattern = re.compile(r'\d+\.\d+')
if length == None:
return
# Search the text for matches
mile = re.match(pattern, length)
# If a value is returned, use group(0) to return the found value
if mile is not None:
return float(mile.group(0))
# Apply the function to the Length column and take a look at both columns
hiking['Length_num'] = hiking['Length'].apply(lambda row: return_mileage(row))
hiking[['Length', 'Length_num']].head()
Length | Length_num | |
---|---|---|
0 | 0.8 miles | 0.80 |
1 | 1.0 mile | 1.00 |
2 | 0.75 miles | 0.75 |
3 | 0.5 miles | 0.50 |
4 | 0.5 miles | 0.50 |
Let’s transform the volunteer
dataset’s title
column into a text vector, to use in a prediction task in the next exercise.
from sklearn.feature_extraction.text import TfidfVectorizer
# Need to drop NaN for train_test_split
volunteer = pd.read_csv('dataset/volunteer_opportunities.csv')
volunteer = volunteer.dropna(subset=['category_desc'], axis=0)
# Take the title text
title_text = volunteer['title']
# Create the vectorizer method
tfidf_vec = TfidfVectorizer()
# Transform the text into tf-idf vectors
text_tfidf = tfidf_vec.fit_transform(title_text)
Now that we’ve encoded the volunteer
dataset’s title
column into tf/idf vectors, let’s use those vectors to try to predict the category_desc
column.
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
# Split the dataset according to the class distribution of category_desc
y = volunteer['category_desc']
X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)
# Fit the model to the training data
nb.fit(X_train, y_train)
# Print out the model's accuracy
print(nb.score(X_test, y_test))
0.5032258064516129